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Simple floating point operations like addition or multiplication on normalized floating point 
values can be computed by current AMD and Intel processors in three to five cycles. This is 
different for denormalized numbers, which appear when an underflow occurs and the value can 
no longer be represented as a normalized floating-point value. Here the costs are about two 
magnitudes larger. 

1 Introduction 

Simple floating point operations like addition or multiplication on normalized floating point 
values can nowadays be computed by current AMD and Intel processors in three to five cycles. 
This is different for denormalized numbers, which appear when an underflow occurs and the 
value can no longer be represented as a normalized floating-point value. Here the costs are 
about two magnitudes larger. Often this is not noticed as this gradual underflow is normally 
avoided, by configuring the floating point units to tread underflowed values as zero, as described 
in section 2. 

The object of this short report is to quantify the performance impact on floating point opera¬ 
tions when denormalized/NaN values, overflows, or divisions by zero occur. Hereby the focus 
is only on 

• double precision floating point addition, multiplication, division and fused-multiply-add 

• with the AVX, AVX2, and FMA3/4 ISA extensions 

for the x86-64 architecture. Single precision, x87, SSE, the influence of different rounding 
modes, etc. are not considered. 

2 Flush-to-Zero and Denormals-are-Zero 

The SSE/AVX floating point units of the current x86-64 architecture support two complimen¬ 
tary modes for avoiding the enormous costs of gradual underflow: 

• DAZ: Denormalized values of input operants are treated as zero, which is called denor¬ 
mals are zero. 
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• FTZ: Withy?M5/j to zero (FTZ) the denormalized result of floating point operations are set 
to zero. 

Both options are controlled through specific bits in the floating point control register MXCSR. 
For FTZ and DAZ bit 15 and 6 is responsible, respectively. Additionally for FTZ it makes 
sense to mask underflow exceptions through bit 11. The manipulation of MXCSR is performed 
via the LDMXCSR and STMXCSR instructions or their intrinsic equivalents _mm_getcsr() and 
_mm_setcsr(). Intel provides some more details [4]. Per default GCC and Intel compiler 
insert code to use FTZ and DAZ, which can be altered via parameters. This is described in the 
corresponding compiler documentation. 


3 Benchmarked Systems 

For measuring the duration of floating point operations three Intel-based and one AMD-based 
system were used. The Intel systems are based on the three microarchitectures SandyBridge, 
IvyBridge, and Haswell. From AMD only the older Bullzoder-based Interlagos was available. 
Table 1 gives a short overview of the systems’ parameter. Instruction throughput and latency 
numbers are taken from the vendors [2, 5] and FOG [3]. Throughput describes how many 
independent instructions of a certain type can be issued per cycle. On the other hand latency 
denotes the duration of the execution of an instruction in cycles. 

On all Intel-based processors each core has a separate multiplication and addition unit. This 
enables them to execute these two operations in parallel. Each Haswell core has an additional 
multiplication unit, located on the same port as the add unit. Either two multiplications or one 
addition and multiplication can be executed concurrently. Additionally, two 256-bit wide EMA 
units are available [5], sharing the same ports as the multiplication and addition ports. 

An Interlagos floating point module instead, which is shared by two adjacent cores, has two 
128-bit wide fused-multiply-add (EMA) units [2]. On each cycle they can receive an AVX 
multiplication, addition, division, or EMA from one of both cores. 

The EMA support from Intel and AMD differs in their implementation. Intel uses a three 
operant destructive form (EMA3): a = a x b + c. AMD on the other hand uses four operants 
(EMA4) where the source operant is not overridden: a = bxc + d. 

4 Micro-Benchmarks 

Eor benchmarking small micro-benchmarks were used which perform following operations on 
double precision floating point vectors: 


Addition: 

a(:) 

= b(:) 

+ 

c(:) 

Multiplication: 

a(:) 

= b(:) 

X 

c(:) 

Division: 

a(:) 

= b(:) 

/ 

c(:) 

EMA: 

a(:) 

= b(:) 

X 

c(:) 


Different types of input values are tested. Eirstly the vectors are initialized in such a way that 
the results of the computations are normalized values. Eurther input values are chosen, which 
provoke underflow, overflow, or division by zero. And finally it is tested how the duration of the 
operation is influenced if not-a-number (NaN) values are used as input operants. The operations 
are implemented as two nested loops over the vectors to ensure the duration of the benchmark 
is long enough: 
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SandyBridge 

IvyBridge 

Haswell 

Interlagos 

Type 


Intel 

Intel 

Intel 

AMD 



Xeon E5-2680 

E5-2660 v2 

E5-2695 v3 

Opteron 6276 

Frequency 

[GHz] 

2.7 

2.2 

2.3 

2.3 

Cores 


8 

10 

12 

16 

ISA 


AVX 

AVX 

AVX, AVX2 

AVX, EMA4 





EMA3 


AVX Addition 

per cy 

1 

1 

1 

1 

AVX Multiplication 

per cy 

1 

1 

2 

1 

AVX Add/Mul 

per cy 

1/1 

1/1 

1/1, 0/2 

1/0, 0/1 

AVX Addition 






Throughput 

per cy 

1 

1 

1 

1 

Latency 

[cy] 

3 

3 

3 

6 

AVX Multiplication 






Throughput 

per cy 

1 

1 

2 

1 

Latency 

[cy] 

5 

5 

5 

6 

AVX Division 






Throughput 

per cy 

"0.025 

"0.04 

"0.04 

'’0.03-0.11 

Latency 

[cy] 

'’21-45 

'’20-35 

'’19-35 

27 

FMA (256-bit wide) 






Throughput 

per cy 

- 

- 

2 

1 

Latency 

[cy] 

- 

- 

5 
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Table 1: Relevant architectural characteristics of the evaluated systems. If not otherwise noted, instruction 
throughput and latency numbers are taken from [2, 3, 5]. 

“ Measured with micro-benchmark. 

Taken from Fog [3]. 


for (int n = 0; n < repetitions; ++n) { 
for (int i = 0; i < vectorLength; ++i) { 

// benchmark operation on vector element i 

} 

} 


With AVX the innermost loop gets vectorized, so that during one AVX iteration four scalar 
iterations are performed at the same time. 

Implementing these operations with C/C-i-i- or Fortran requires beside executing the compu¬ 
tations itself loading and storing the involved vectors. This introduces a bottleneck, even when 
the vectors reside in the cores’ LI cache and the full floating point performance will not be 
visible. With the shown vector operations all iterations over the vectors are independent and 
prefetching, as well as the out-of-order engine, can work perfectly. Thus for computing the 
resulting performance only the throughput of the instructions is relevant and latencies can be 
ignored, assuming data resides in the LI cache. The SandyBridge and IvyBridge systems have 
the following properties: 

• 1 cy for full AVX load, 

• 2 cy for full AVX store, 

• 1 cy for AVX multiplication, 

• 1 cy for AVX addition. 
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To perform one AVX iteration, i. e. four scalar iterations, the multiplication and add benchmarks 
each require 

• two AVX loads, 

• one AVX store, 

• one multiplication/addition. 

All evaluated processors are superscalar and can execute load, store, and arithmetic instructions 
concurrently. With this assumptions one AVX iteration takes 2 cy, as it is limited by the single 
AVX store and the two AVX loads. On the average one single addition or multiplication of 
the corresponding AVX version takes 0.5 cy. This can be seen in Tab. 2 for the corresponding 
architectures when the C kernel is used. 

The addition and multiplication units however, have a throughput of 1 AVX addition/multi¬ 
plication per cycle, which results in 0.25 cy per single operation. This limit can only be reached 
if the bottleneck is removed and the code is no more load and store bound. 

By explicitly implementing these benchmarks in assembly this problem can be avoided. The 
vector length is chosen short enough so that all operants can be kept in registers, so that addi¬ 
tional loads or stores from and to the cache are no longer required. With these benchmarks the 
full throughput of 0.25 cy is achieved as reported in Tab. 2 for normalized numbers with the 
ASM kernel. 

The body of the innermost iteration loop of the multiplication benchmark for example looks 
like the following with this adjustments (Intel semantics): 

vmulpd ymm9, ymml, ymm5 

vmulpd ymmlO, ymm2, ymm6 

vmulpd ymmll, ymmS, ymm7 

vmulpd ymml2, ymm4, ymmS 

Here a vector length of 16 was chosen. The registers TMM1-FMM8 are initialized with the 
values of the vectors a and b before the loop is entered and are then reused. It is important 
to note that the innermost loop must now be unrolled often enough to hide the latency of the 
benchmarked operations. For the multiplication this is 5 cy, as there exists no dependency 
between the target registers. A unroll factor > 4 hides this latency as shown in the previous 
code snippet. 

For the benchmark it is assumed that the execution units are not able to cache previously 
computed values over a cycle of the innermost loop and thus cannot use some short cut when 
same operants appear again. 

All other benchmarks are implemented accordingly using the AVX instruction set. Addition¬ 
ally for the FMA benchmark on Haswell and Interlagos FMA3 and FMA4 were used, respec¬ 
tively. 


5 Results 

All micro benchmarks were executed with enabled and disabled FTZ and DAZ. The results are 
shown in Tab. 2. The reported values specify the duration of a single floating-point operation in 
cycles for the specific operation like addition, multiplication, division, or fused-multiply-add. 
The visible duration of the full AVX or FMA instruction is four times the reported number. The 
input values for the micro benchmarks were adjusted to generate as a result normalized values, 
underflows, overflows, or divisions-by-zero. Furthermore the impact of denormalized and Nan 
values as input operants are evaluated. As already mentioned the full throughput for addition, 
multiplication, division-by-zero, and FMA is only reached when utilizing the assembly version 
of the benchmark (ASM) instead of the C implementation. 
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With FTZ and DAZ enabled (“F+D” columns) the measured durations of the specific oper¬ 
ations are within the documented ranges. Disabling FTZ and DAZ (“No F-i-D” columns) as 
expected does not increase the costs for operations with normalized input and output values. 

Addition Without FTZ and DAZ the duration of the additions is independent of overflows, 
denormalized input values, and NaNs. Additions are only sensitive to underflows, which take 
then around 36-38 cy. Despite these high values, throughout the development of Intel’s mi¬ 
croarchitectures it is evident, that the handling of this case has been improved. The duration of 
an underflowing addition was reduced from 38.20cy (SandyBridge), over 37.70cy (IvyBridge) 
to 31.90 cy (Haswell). 

Multiplication Multiplications become expensive in case FTZ and DAZ are disabled and 
either an underflow occurs or input operants contain denormalized values. On the Intel pro¬ 
cessors, if both input operants are denormals, i. e. the multiplicand and the multiplier, then the 
duration is 0.25 cy, the same as with normalized values. As in the case of the addition, an 
overflow or NaN input values introduce no extra cost. 

Division The duration of a division ranges from 7 to 10 cy on the Intel architectures and 
requires 5 cy on Interlagos for normalized input values. With enabled FTZ and DAZ overflows 
and underflows to not introduce additional costs. Division-by-zero and denormalized input 
values seem to be detected in an early stage. Their throughput duration is only half of a division 
with normalized operants. 

With disabled FTZ and DAZ overflows have no impact on the instruction duration. In con¬ 
trast, an underflow in the division takes 71 cy (SNB), 63 cy (IVB), and 57 cy (HSW) compared 
to the Ri 41 cy on the AMD system. Denormalized input operants are always connected with a 
penalty, except for the AMD system, where only a denormalized dividend is expensive. 

FMA According to IEEE 754-2008 [1] the fused-multiply-add operation should compute 
b X c + d as with infinite precision and round only once at the end. Haswell with EMA3 and 
Interlaogs with EMA4 show both an interesting behavior, when an underflow in the multiplica¬ 
tion of the EMA occurs and ETZ and DAZ are disabled. An underflow with a pure AVX mul¬ 
tiplication instruction (ETZ and DAZ is disabled) costs 33 cy (Haswell) and 37 cy (Interlaogs), 
whereas no penalty is measured, when this occurs with the EMA instructions. In contrast, an 
underflowing addition in EMA with disabled ETZ and DAZ is time-consuming. 

6 Conclusion 

Eloating point operations like addition and multiplication with normalized input and output 
values are handled in three to five cycles. With enabled flush-to-zero (ETZ) and denormals-are- 
zero (DAZ), which is the default case for GCC and the Intel compiler if not otherwise specified, 
underflow, overflow, NaNs, and divisions-by-zero have no negative performance impact. 

If however, the additional precision gained by gradual underflow is required ETZ and DAZ 
must be disabled. The costs for underflowing operations are then about two magnitudes higher 
than the normalized operations for AVX addition, multiplication, and division. In the case of 
EMA only an underflow during the addition is costly. An underflowing multiplication within 
EMA introduces no additional costs. 
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Table 2: Duration of a single floating-point operation in eyeles for speeifie AVX floating point benehmarks with different values of the input operands. The 
visible duration of the full AVX instruetion is four times the reported number, whieh is also the inverse throughput. Measurements were obtained 
with FTZ and DAZ enabled and disabled, denoted as F-i-D and No F-i-D, respeetively. In the assembly benehmarks (ASM) the veetors were kept in 
registers to avoid the load/store bottleneek. 
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